Wine Appreciation by The Numbers

by Anna Signor

What is “good” wine? This is a captivating subject, since so much actual economy rides on judgements considered by many to be elusive.

Join me in exploring data pertaining to the analysis of over 6,000 instances of Portuguese Vinho Verde, juxtaposing objective measurements with the quality ratings of experts. Can we decode what makes an expert give a wine a high quality mark? Are they biased toward reds or whites? Are “better” wines less sweet? Let’s find out.

(I reccomend this read to go along one glass of your favorite wine.)

Sample the data

Let’s get to know our data set. You can find the complete information furnished buy the publishers here. In their words:

…two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

I merged the two set and added a column called type to indicate red or white, so we have the columns:

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "type"

Please note that I merged the two sets together and the field type indicates red or white. You can refer to the same link as above to get detailed definitions.

Here are the variables with the units of measurement and their vernacular names, quoted from the original published article:


1 - fixed acidity (tartaric acid - g / dm^3)

2 - volatile acidity (acetic acid - g / dm^3)

3 - citric acid (g / dm^3)

4 - residual sugar (g / dm^3)

5 - chlorides (sodium chloride - g / dm^3

6 - free sulfur dioxide (mg / dm^3)

7 - total sulfur dioxide (mg / dm^3)

8 - density (g / cm^3)

9 - pH

10 - sulphates (potassium sulphate - g / dm3)

11 - alcohol (% by volume)

Output variable (based on sensory data):

12 - quality (score between 0 and 10)


Univariate Plots Section

The first thing I’d like to know if if I have to account for a bias toward red or white in terms of quality. Let’s look at their summaries.

Whites:

whites <- filter(wines, type == 'white')
reds <- filter(wines, type == 'red')
summary(whites$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.852   6.000   9.000

Reds:

summary(reds$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The distribuition stats are very similar, it looks only like the mean for whites is slightly higher. How do they look side by side?

It is interesting to see the full distribuitions side by side. While the difference between the means looks small, and the medians are the same, the shape of the distribuitions are different, and it looks like white wines score better. I should also point out that the median being the same has reduced meaning when the variable being measured is ordinal.

And here is the combined distribuition:

Separately or combined, they are both normal-like disribuitions. In the back of my head, I am putting a pin on this. If I try to write a predictor for the quality, I need to be careful. We already know this distribuition, so any valuable predictor needs to be better than just guessing around the mean or the median, because this could produce deceptively good results without actually adding value to a simple study of the distribuition.

I wonder which measured attributes do not behave that way.

Some distribuitions are extremely skewed, like chlorides and residual sugar. In fact, so much so that it is hard to get a sense for the distribuition. Let’s look at the same plot with a log scale.

This is how it looks for whites:

And with the same consideration regarding the extreme distribuitions:

And the same two plots for reds

I want to see the two log10 grids next to one another and compare:

Univariate Analysis

From this one-dimensional analysis I learned a lot. Firstly, the quality grades are in a normal-like distribution, so will need to be very strict before celebrating the accuracy of any predictive model.

Another important lesson is to note the skewed distribuitions for some of the variables, like residual sugar. Moving to more than one variable, this analysis may benefit from axes tranformations.

It is very hard to get any answers without exploring the relationships between the different features and their relationships to the quality. We do that next.


Bivariate Plots Section

One feature that stood out to me was the volatile acidity. This is how I understood from their documentation: volatile acidity is the kind that goes away in time after opening a bottle of wine, versus the actual acidity that “belongs” in the wine. I always heard that the best red wines don’t need to breathe, you can drink as soon as the bottle is opened. (And I never heard of white wines needing to breathe at all.) I wonder if that is related to the volatile acidity, and if we can see that in the data.

Let’s first plot the volatile acidity against the quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

There seems to be a trend where the quality increases with the decrease in volatile acidity. It seems they may be inversely correlated.

What happens if we include the whites? I would expect the relationshit to be different than with the reds, because I never heard of white wine having to breathe.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2759  0.3200  1.0050

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

In whites,the correlation much weaker, or is less evident. Also, the volatile acidity is significantly and consistently less than in reds. In red wines, the range is 0.12 - 1.58 g/dm3, while in whites it is 0.08 - 1.00g/dm3. It also seems that low quality wines have higher volatile acidity than their type mates, no matter what.

A popular conception is that sweet wines are “bad”. Let’s take stab at that one.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   2.400   4.462   5.900  65.800

It looks like we hae better eliminate the outliers at the top to properly see the scales here.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   2.400   4.462   5.900  65.800

It is still hard to see. Let’s eliminate the entire upper quartile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   2.400   4.462   5.900  65.800

I’d really be surprised if there is any correlation here. At least for Vinho Verde, this myth is bust. All we can say is that white wines in that family are consistently higher in sugar content, which surprised me a lot.

Let’s look at the impact of free sulfides into quality:

The biggest piece of information is the difference between reds and whites. It seems like, at least by itself, the free sulfide content has no meaningful relationship with the quality.

Volatile acidity vs acidity:

I see the possibility of a correlation, but only in reds. Let’s look at just them:

There are some signs pointing to a correlation, more visible now. I’d also guess that it is conditioned on other variables, which may be responsible for the outliars. I am just going to take some guesses on what they could be, maybe I’ll get lucky. So I am plotting just the red wines data, with a color distortion on my lurking variable candidates. If the outliars tend to appear in a color that stands out, that will likely be a lurking variable to this correlation. In other words, in the plots to come, what we will be looking for is a lot of the same color away or close to the trend line.

Citric Acid:

I don’t necessarily see a color pattern playing off the trendline, but this variable seems correlated with the fixed acidity. Let’s check:

Nice. They look correlated.

Sugar content:

Sulphates:

Alcohol:

This one looks more interesting. You can kind of see a lot of green away from the trendline, but not a jackpot.

Mmm.. Log on x?

Meh.. nothing to see here.

Chlorides:

This one looks more interesting. You can kind of see a lot of green away from the trendline, but not a jackpot. There is green kind of everywhere… Let’s try:

Nothing of interest here.

Density:

This one just looks well correlated with the fixed acidity. Let’s look at that:

Density

Bingo!

Moving on, I’d like to see if there is an apparent correlation between alcohol content and quality:

Now, this is interesting. This seems to be a very relevant factor.

Let’s look at each variable besides the quality and type plotted against quality as a smooth trendlineThis should give us a good high-level view of the correlations between the variables and their conributions to the quality feature.

The dots are really just adding clutter here. I really just want to explore the possible nature of relationship, not strengh just yet. Instead, let’s look at the same plot as above, without the geom jitter.

The first thing that stands out to me, looking at this, is how some of the relatioships are very different for reds and whites. In some cases, the relationships are “opposing”, as in with sulphates and total sulfur dioxide. In other cases you see a feature be relevant in one type and not in the other, as in free sulfur dioxide. Furthermore, the error bands are much wider for the white wines, indicating I may find stronger relationships in the red wine set.

For this reason, this exploration will proceed for red wines only. (I understand now why the publishers had this as two separate data sets.)

Now that this is cleaner, on a second look, some pairs of features stand out as having possible similar relationships with the quality, which means that they could be correlated. Of note:

As far as the strengh of the relationship with the quality, it seems like the alcohol, volatile acidity and density have the overall strongest relationships, note the reduced width of the error band.

Bivariate Analysis

The first guess I took, based on the fact that red wines have to “breathe” in general, but not the best wines, is an inverse relationship between the volatile acidity and the quality. We found the relationship is definitely there, for red wines. In further accordance with expectations, the relationship is different with white wines and the same behavior is not observed.

I tried to explore another piece of popular wisdom, or at leat my oprational definition thereof, that sweet wines are not good quality, by looking for an inverse correlation between residual sugar and quality. I was unable to find any such relationship.

My next exploration was looking at the relationship between volatile and fixed acidity, and looking for possible variables that may be correlated with either one. By doing this, I found another correlated pair: density and acidity.

Concluding the bivariate exploration, I took a general look at all the relationships between each feature and the quality. I found a “separation” between red and white wines in terms of those relatioships. For this reason coupled with my own personal preference, I will proceed with an exploration for red wines only. Looking at the grid of scaterplots + regression lines, some pairs of features stand out as possible candidates for intercorrelation. In the next session, I will explore those.


Multivariate Plots Section

Let’s start by looking at the last plot from the last section, showing the relationships between each feature and the quality:

It looks like a lot of these features are strong and discerning. Because of this, I will try a decision tree. Decision trees work best when the predicted value is categorical or ordered. So, let’s call a wine “ge” for “good or excellent”" when the quality grade is 6 or more, and bm for “bad or mediocre” otherwise.

##   grade_f grade type quality alcohol sulphates   pH density
## 1      mb    mb  red       5     9.4      0.56 3.51  0.9978
## 2      mb    mb  red       5     9.8      0.68 3.20  0.9968
## 3      mb    mb  red       5     9.8      0.65 3.26  0.9970
## 4      ge    ge  red       6     9.8      0.58 3.16  0.9980
## 5      mb    mb  red       5     9.4      0.56 3.51  0.9978
##   total.sulfur.dioxide free.sulfur.dioxide chlorides residual.sugar
## 1                   34                  11     0.076            1.9
## 2                   67                  25     0.098            2.6
## 3                   54                  15     0.092            2.3
## 4                   60                  17     0.075            1.9
## 5                   34                  11     0.076            1.9
##   citric.acid volatile.acidity fixed.acidity
## 1        0.00             0.70           7.4
## 2        0.00             0.88           7.8
## 3        0.04             0.76           7.8
## 4        0.56             0.28          11.2
## 5        0.00             0.70           7.4

As you can see, now we have a grade, which can only assume two values, and a factor column corresponding to it. Now, let’s split the data in to train and test, at a 75% rate, and take a peek at the train set just to make sure everything looks normal.

##   grade_f grade type quality alcohol sulphates   pH density
## 1      mb    mb  red       5     9.4      0.56 3.51  0.9978
## 2      mb    mb  red       5     9.8      0.68 3.20  0.9968
## 3      mb    mb  red       5     9.8      0.65 3.26  0.9970
## 4      ge    ge  red       6     9.8      0.58 3.16  0.9980
## 5      mb    mb  red       5     9.4      0.56 3.51  0.9978
##   total.sulfur.dioxide free.sulfur.dioxide chlorides residual.sugar
## 1                   34                  11     0.076            1.9
## 2                   67                  25     0.098            2.6
## 3                   54                  15     0.092            2.3
## 4                   60                  17     0.075            1.9
## 5                   34                  11     0.076            1.9
##   citric.acid volatile.acidity fixed.acidity
## 1        0.00             0.70           7.4
## 2        0.00             0.88           7.8
## 3        0.04             0.76           7.8
## 4        0.56             0.28          11.2
## 5        0.00             0.70           7.4

Let’s look at a representation of a decision tree using only the features volatile acidity and sulphates content. Here, I made a deliberate decision to use feature that give strong correlations, and to use one positively and one negatively correlated (or apparently so), to give us the best chance to see a decent tree. Using two features is in no way the best modeling strategy, but it yields a good graphical representation of how decision trees work. We will eventually include more features, and that will look impalatable as a graph.

To understand how the decision tree is working, let’s consider an example, from the test set.

rev(test[8, ] )
##    grade_f grade type quality alcohol sulphates   pH density
## 50      mb    mb  red       5     9.2      0.58 3.32  0.9954
##    total.sulfur.dioxide free.sulfur.dioxide chlorides residual.sugar
## 50                   96                  12     0.074            1.4
##    citric.acid volatile.acidity fixed.acidity
## 50        0.37             0.31           5.6

This particular data point is a wine with a volatile acidity value of 0.31g/dm3, suplphates at 0.58g/dm3, and a quality rating of 5, which makes it a “mediocre or bad” wine, per our definition of grade. The diagram below shows how the tree above would predict this wine’s quality.

Picking just two features is a simplistic way to proceed (although I exercised judgement in picking them), but just for curiosity, let’s lee what accuracy we got from that, by testing the model with data that was not used to build the tree, that is the test data.

This does not look the greatest. What we are looking for is a segregation of yellow and green Let’s check the confusion matrix for this:

## [1] 0.6675
##     
##       ge  mb
##   ge 356  72
##   mb 194 178

Now, let’s build a tree with all the features, which will likely perform better, but will not look good plotted.

Confusion matrix and accuracy:

## [1] 0.74
##     
##       ge  mb
##   ge 312 116
##   mb  92 280

…and the accuracy is 74% not bad for an out-of-the box tree with littl to none feature engineering.

Thi is a fine way to see our classifier perform, but let’s plot something a tad more interesting. To get a more comprehensive qualitative perception of the classifier’s behavior, we can facet wrap the same plot above, with alcohol vs ech of the other features,since quality and grade are information redundant with with the prediction.

Above, the points are colored yellow when they are predicted to be bad or mediocre, and green if they are predicted to be good or excellent. In addition, they are triangles if they are actually mb and circles if they are ge. So, to gauge accuracy you’d have to look for a scarsity of green triangles and yellow circles, which is hard to see, but this is not the point of this chart. We know the precision from the confusion matrix, that’s 74% and will not change no matter how we look at it.

What I am trying to visuallize here the tree’s behavior in making classifications. Specifically, we are looking at the decision boundary in 10 of over 100 possible ways to look at it in 3d. In this case, the boundary is an 11-dimensional surface embedded in a 12-dimensional space. Each cell in the facet-wrap above is its projection onto different 2-d spaces spanned by the features alcohol, and each of the remainder, represented by the line that separates the yellow points from the green ones.

This type of visualization can give one insight into how to tune the tree, engineer features, or even move to a different model. For example, boundaries with complicated geometries mean your tree is over-fitted, this is a very common issue in decision trees.

This exploration ends here, a further project would involve taking steps to improve the accuracy of the model. My first would be to go to random forest classifier or some other sort of hybrid model to counter the over-fitting I see here. Feature transformations as well as PCA look like good steps to pursue in this case, as well.

Multivariate Analysis

After plotting all the features against the quality, it was determined that many of them seem relevant. I made the decision to create a new feature called grade, which segregates the records into two categories, and attempt to build a classifier.

As part of this exploration, I built a starter tree, with just two features, and it was a surprise to see it yield an accuracy of 65%. The final tree is an out of the box binary tree trained on 75% of the data and scoring an accuracy of 74%.


Final Plots and Summary

Plot One - separation between reds and whites

The method used to show the trends above is LOESS, which is the same through the project, excluding when a comparison with a linear regresion was desired. In this case, due to the number of plot cells, size and number of datapoints, the actual dot plot is not shown, as it was more confusing than illuminating. Observe how for many of the features, the correlational behavior is dramatically different. Enphasis on Average pH, sulphate content, and acidity. The exploration of the relationships could proceed in a more focused manner done separately. There is no objective reason why I chose reds, the main reason was relatability: I am able to interpret the data in a richer manner since I simply know more facts about red wines than white.

Plot Two - volatile acidity vs quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

A trend become very clear here. Higher quality wines are less and less likely to have a high amount of volatile acidity, and none of the wines graded good or excellent have more than 1.1g/dm3 volatile acidity. The maroon trend line above is a regression using the LOESS method. I used this throughout the project, to avoid a “liner blinders” effect. If the relationship is not best modeled by a line, I want to be honest about that. In this case, however, what is really interesting is that the LOESS is giving us something extremely similar to a line. The yellow regresssion line is a linear regression, note how much overlap there is. It is only where the distribution becomes sparse that the relationship becomes less linear

Plot Three - a look at our the predictor’s behavior

The first thing to note here is that this is by no means a finished product, rather the seed for the possible development of a well performing predictor. This is a typical result of a data exploration. (If this was a crime case, this project is the investigation, not the prosecution.) With that caveat, let’s look at the behavior if the decision tree.

Vizualizing the accuracy of the model here would be difficult, as it would essentially consist of looking for yellow circles and green triangles. This is not the intention for this visualization, rather, I am plotting a look at the decision boundary. What we see above are 10 projections of 12-dimentional surface into 2-dimesional spaces spanned by alcohol and one of the other features, in each cell. The projection of the boundary is a line that bisects the plane leaving only yellow on one side, and only green on the other.

What I see are complicated geometries, which means my model is over-fitted to the train data. Next steps, for a next project, in pursuit of improving the quality of the predictor, are: * transition to a random forest classifier or other hybrid method * exclude outliers * PCA * feature transformation

Reflection

This is a very interesting data set because it juxtaposes chemical properties that can be objectively measured with quality scores, which not ony are subjective, but a controversial subject, and a driver for large-scale business performance, as well as micro-economic measures.

I started by taking a look at distributions, and then looking for correlations between the features, and between features and prediction target. Some string correlations were found among pairs of features, some surprising and some, wholly expected (like citric acid and pH). One interesting side exploration was the finding on volatile acidity, since this has a very relatable interpretation regading the “breathing” of red wines. I was also at this point (bivariate exploration) that I decided to continue on examining red wines only.

The final exploration branch was what can be dewscribed as a “go/no-go” for building a tree-type classifier. In other words, I never meant to finish this with a well performing classifier, but wanted to find out if this is even worth it to attempt. The out-of-the-box tree performed at 74% which is testament to the quality of this data set. I then did a very superficial analysis of one facet of the decision boundary and made determinations as to what the steps could be to improve the model.